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METHOD AND SYSTEM TO CONTROL OPERATION OF A 

PLAYBACK DEVICE 

CROSS-REFERENCE TO A RELATED APPLICATION 

[0001] This application claims the benefit of United States Provisional 

Patent Application entitled "Method and Apparatus to Control Operation of a 
Playback Device", Serial No.: 60/709,560, Filed 19 August 2005, the entire 
contents of which is herein incorporated by reference. 

TECHNICAL FfF.LT> 

[0002] This application relates to a method and apparatus to control 

operation of a playback device. In an embodiment, the method and apparatus 
may control playback, navigation, and/or dynamic playHsting of digital content 
using a speech interface. 



BACKGROUND 



[0003] Digital playback devices such as mobile telephones, portable 

media players (e.g., MP3 players), vehicle audio and navigation systems, or the 
like typically have physical controls that are utilized by a user to control 
operation of the device. For, example, functions such as "play", "pause", "stop" 
and the like provided on digital audio players are in the form of switches or 
buttons that a user activates in order to enable a selected function. A user 
typically will press a button (hard or soft) with a finger to select any given 
fimction. Further, commands that the devices may receive from a user are 
limited by the physical size of the user interface comprised of hard and soft 
physical switches. For example, road navigation products that incorporate 
speech input and audible feedback may have limited physical controls, display 
screen area, and graphical user interface sophistication that may not enable easy 
operation without speech input and/or speaker output. 
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BRIEF DESCRIPTTON OF DRAWINOS 

[0004] Some embodiments are illustrated by way of example and not 

limitation in the figures of the accompanying drawings in which: 

[0005] Figure 1 shows system architecture for playback control, 

navigation, and dynamic playlisting of digital content using a speech interface, 
in accordance with an example embodiment; 

[0006] Figure 2 is a block diagram of a media recognition and 

management system in accordance with an example embodiment; 

[0007] Figure 3 is a block diagram of a speech recognition and synthesis 

module in accordance with an example embodiment; 

[0008] Figure 4 is a block diagram of a media data structure in 

accordance with an example embodiment; 

[0009] Figure 5 is a block diagram of a track data structure in 

accordance with an example embodiment; 

[0010] Figure 6 is a block diagram of a navigation data structure in 

accordance with an example embodiment; 

[0011] Figure 7 is a block diagram of a text array data structure in 

accordance with an example embodiment; 

[0012] Figure 8 is a block diagram of a phonetic transcription data 

stmchire in accordance with an example embodiment; 

[0013] Figure 9 is a block diagram of an alternate phrase mapper data 

structure m accordance with an example embodiment; 

[0014] Figure 10 is a flowchart illustrating a method for managing 

phonetic metadata on a database according to an example embodiment; 

[0015] Figure 11 is a flowchart illustrating a method for altering 

phonetic metadata of a database according to an example embodiment; 

[0016] Figure 12 is a flowchart illustrating a method for using metadata 

with an application according to an example embodiment; 

[0017] Figure 13 is a flowchart illustrating a method for accessing and 

configuring metadata for an application according to an example embodiment; 
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[0018] Figure 14 is a flowchart illustrating a method for accessing and 

configuring media metadata according to an example embodiment; 
[00191 Figure 15 is a flowchart illustrating a method for processing a 

phrase received by voice recognition according to an example embodiment; 
[0020] Figure 16 is a flowchart illustrating a method for identifying a 

converted text string according to an example embodiment; 
[0021] Figure 17 is a flowchart illustrating a method for providing an 

output string by speech synthesis according to an example embodiment; 
[0022] Figure 18 is a flowchart illustrating a method for accessing a 

phonetic transcription for a string according to an example embodiment; 
[0023] Figure 19 is a flowchart illustrating a method for 

programmatically generating the phonetic transcription according to an example 
embodiment; 

[0024] Figure 20 is a flowchart illustrating a method for performing 

phoneme conversion according to an example embodiment; 
[0025] Figure 21 is a flowchart illustrating a method for converting a 

phonetic transcription into a target language according to an example 
embodiment; and 

[0026] Figure 22 illustrates a diagrammatic representation of an example 

machine in the form of a computer system within which a set of instructions, for 
causing the machine to perform any one or more of the methodologies discussed 
herein, may be executed. 

DETAILED DESCRIPTION 

[0027] An example method and apparatus to control operation of a 

playback device are described. For example, the method and apparatus may 
control playback, navigation, and/or dynamic playlisting of digital content using 
speech (or oral communication by a listener). In the following description, for 
purposes of explanation, numerous specific details are set forth in order to 
provide a thorough understanding of an embodiment of the present invention. It 
will be evident, however, to one skilled in the art that the present invention may 
be practiced without these specific details. Merely by way of example, the 
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digital content may be audio (e.g. music), still pictures/photographs, video (e.g., 
DVDs), or any other digital media. 

[0028] Although the invention is described by way of example with 

reference to digital audio, it will be appreciated to a person of skill in the art that 
it may be utiUzed to control the rendering or playback of any digital data or 
content. 

[0029] The example methods described herein may be implemented on 

many different types of systems. For example, one or more of the methods may 
be incorporated in a portable unit that plays recordings, or accessed by one or 
more servers processing requests received via a network (e.g., the Internet) from 
hundreds of devices each minute, or anything in between, such as a single 
desktop computer or a local area network. In an example embodiment, the 
method and apparatus may be deployed in portable or mobile media devices for 
the playback of digital media (e.g., vehicle audio systems, vehicle navigation 
systems, vehicle DVD players, portable hard drive based music players (e.g., 
MP3 players), mobile telephones or the like). The methods and apparatus 
described herein may be deployed as a stand alone device or fully integrated into 
a playback device (both portable and those devices more suitable to a fixed 
location (e.g., a home stereo system). 

[0030] An example embodiment allows flexibility in the type of data and 

associated voice commands and controls that can be delivered to a device or 
application. An example embodiment may deliver only the commands that the 
application rendering the audio requires. Accordingly, implementers deploying 
the method and apparatus in their existing products need only use the generated 
data they need and that their particular products require to perform the requisite 
functionality (e.g., vehicle audio system or application running on such a system, 
MPS player and application software running on the player, or the like). In an 
example embodiment, the apparatus and method may operate in conjunction 
with a legacy automated speech recognition (ASR)/ text-to-speech (TTS) 
solution and existing application features to accomplish accurate speech 
recognition and synthesis of music metadata. 
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[0031] When used Avith advanced ASR and/or TTS technology, the 

apparatus may enable device manufacturers to quickly enable hands-free access 
to music collections in all types of digital entertainment devices (e.g., vehicle 
audio systems, navigation systems, mobile telephones, or the like). 

[0032] Pronimciations used for media management may pose special 

challenges for ASR and TTS systems, hi an example embodiment, 
accommodatmg music domain specific data may be accomplished with a modest 
increase in database size. The augmentation may largely stem from the phonetic 
transcriptions for artist, album, and song names, as well as other media domain 
specific terms, such as genres, styles, and the like. 

[0033] An example embodiment provides functions and delivery of 

phonetic data to a device or application in order to facilitate a variety of ASR 
and TTS features. These functions can be used in conjunction with various 
devices, as mentioned by way of example above, and a media database, hi an 
example embodiment, the media database can be accessed remotely for systems 
with online access or via a local database (e.g., an embedded local database) for 
non-persistently connected devices. Thus, for example, the local database may 
be provided in a hard disk drive (HDD) of a portable playback device. 

[0034] hi an example embodiment, additional secure content and data 

maybe embedded in a local hard disk drive or in an online repository that can be 
accessed via the appropriate voice commands along with a Digital Rights 
Management (DRM) action. For example, a user may verbally request to 
purchase a track for which access may then be unlocked. The license key and/or 
the actual track may then be locally unlocked, streamed to the user, downloaded 
to the user's device or the like. 

[0035] In an example embodiment, the method and apparatus may work 

in conjunction with supporting data structures such as genre hierarchies, era/year 
hierarchies, and origm hierarchies as well as relational data such as related 
artists, albums, and gem-es. Regional or device-specific hierarchies maybe 
loaded in so that the supported voice commands are consistent with user 
expectations of the target market, hi addition, the method and apparatus may be 
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configured tor one or more specific languages. 

[0036] Figure 1 shows an example high level system architecture 100 

for recognition of media content to enable playback control, navigation, media 
content search, media content recommendations, reading and/or delivering of 
enhanced metadata (e.g., lyrics and cover art) and/or dynamic playlisting of the 
media content. The architecture 100 may include a speech recognition and 
synthesis apparatus 104 in communication with a media management system 
106 and an application layer/user interface (UI) 108. 

[0037] The speech recognition and synthesis apparatus 1 04 may receive 

spoken input 116 and provide speaker output 114 through speech recognition 
and speech synthesis respectively. For example, playback control, navigation, 
media content search, media content recommendations, reading and/or 
delivering of enhanced metadata (e.g., lyrics and cover art) and/or dynamic 
playlisting of media content using a text-to-speech (TTS) engine 1 10 for speech 
synthesis and an automated speech recognition (ASR) engine 1 12 for speech 
recognition commands may allow, for example, navigation functionality (e.g., 
browse content on a playback device) based on the delivered phonetic metadata 
128. 

[0038] A user may provide the spoken input 1 1 6 via an input device 

(e.g., a microphone) which is then fed into the ASR engine 1 12. An output of 
the ASR engine 1 12 is fed into the application layerAJI 108 which may 
communicate with the media management system 1 06 that includes a playlist 
application layer 122, a voice operation commands (VOCs) layer 124, a link 
application layer 132, and a media identification (ID) appHcation layer 134. The 
media management system 106, in turn, may communicate with a media 
database (e.g., of local or online CDs) 126 and a playlisting database 110. 

[0039] hi an example embodiment, the media ID application layer 134 

maybe used to perform a recognition process of media content 136 stored in a 
local library database 1 18 by use of proper identification methods (e.g., text 
matching, audio and/or video fingerprints, compact disc Table of Contents TOC, 
or DVD Table of Programming ) in order to persistently associate the media 
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meiaaata liUAvith the related media content. 136 

[0040] The application layer/user interface 1 08 may process 

communications received from a user and/or an embedded application (e.g., 
within the playback device), while a media player 102 may receive and/or 
provide textual and/or graphical communications between a user and the 
embedded application. 

[0041] In an example embodiment, the media player 1 02 may be a 

combination of software and/or hardware and may include one or more of the 
following: a controls, a port (e.g., universal serial port), a display, a storage, a 
CD player, a DVD player, an audio file, a storage (e.g., removable, and/or 
fixed), streamed content (e.g., FM radio and satellite radio), recording capabiUty, 
and other media. In an example embodiment, the embedded application may 
interface with the media player 102, such that the embedded appUcation may 
have access to and/or control of fiinctionality of the media player 102. 

[0042] In an example embodiment, support for phonetic metadata 128 

may be provided m media-ID application layer 134 by including the phonetic 
metadata 128 in a media data structure. For example, when a CD lookup is 
successful and the media metadata 130 (e.g., album data) is returned, all 
phonetic metadata 128 may automatically be included within the media data 
structure. 

[0043] The playHst application layer 122 may enable the creation and/or 

management of playHsts within the playlisting database 110. For example, the 
playlists may include media content as may be contained with the media 
database 126. 

[0044] As illustrated, the media database 1 26 may include the media 

metadata 1 30 that may be enhanced to include the phonetic metadata 128. hi an 
example embodiment, an editorial process may be utilized to provide broad- 
coverage phonetic metadata 128 to account for any insufficiencies in existing 
speech recognition and/or speech synthesis systems. For example, by explicitly 
associating specifically generated phonetic data 128 directly with media 
metadata 130, the association may assist existing speech recognition and/or 



7 



wo 2007/022533 



PCT/US2006/032722 



speecn synthesis systems that cannot effectively process media metadata 130, 
such as artist, album, and track names, which are not pronounced easily, 
mispronounced, have nicknames, or not pronounced as they are spelled. 

[0045] In an example embodiment, the media metadata 130 may include 

metadata for playback control, navigation, media content search, media content 
recommendations, reading and/or deUvering of enhanced metadata (e.g., lyrics 
and cover art) and/or dynamic playlisting of media content. 

[0046] The phonetic metadata 1 28 may be used by the speech 

recognition and synthesis apparatus 104 to enable functions to work in 
conjunction with the other components of a solution and may be used in devices 
without a persistent Internet connection, devices with an Internet connection, PC 
applications, and the like. 

[0047] In an example embodiment, one or more phonetic dictionaries 

derived from the phonetic metadata 128 of the media database 126 and may be 
created in part or as a whole in clear-text form or another format. Once 
completed, the phonetic dictionaries may be provided by the embedded 
application for use with the speech recognition and synthesis apparatus 104, or 
appended to existing dictionaries already used by the speech recognition and 
synthesis apparatus 104. 

[0048] hi an example embodiment, multiple dictionaries may be created 

by the media management system 106. For example, a contributor (artist) 
phonetic dictionary and a geme phonetic dictionary may be created for use by 
the speech recognition and synthesis apparatus 104. 

[0049] Referring to Figure 2, an example media recognition and 

management system 200 is illustrated, hi an example embodiment, the media 
recognition and management system 106 (see Figure 1) may include the media 
recognition and management system 200. 

[0050] The media recognition and management system 200 may include 

a platform 202 that is coupled to an operating system (OS) 204. The platform 
202 may be a framework, either m hardware and/or software, which enables 
software to run. The operating system 204 may be in communication with a data 
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communication 206 and may further communicate with an OS abstraction layer 
208. 

[0051] The OS abstraction layer 208 may be in communication with a 

media database 210, an updates database 212, a cache 214, and a metadata local 
database 216. The media database 210 may include one or more media items 
218 (e.g., CDs, digital audio tracks, DVDs, movies, photographs, and the like), 
which may then be associated with media metadata 220 and phonetic metadata 
222, In an example embodiment, a sufficiently robust reference fingerprint set 
may be generated to identify modified copies of an original recording based on a 
fingerprint of the original recording (reference recording). 

[0052] In an example embodiment, the cache 214 may be local storage 

on a computing system or device used to store data, and may be used in the 
media recognition and management system 200 to provide file-based caching 
mechanisms to aid in storing recently queried results that may speed up future 
queries. 

[0053] Playlist-related data for media items 21 8 in a user's collection 

may be stored in a metadata local database 216. In an example embodiment, the 
metadata local database 216 may include the playlisting database 110 (see 
Figure 1). The metadata local database 216 may include all the infomiation 
needed during execution of a playhst creation 232 at direction of a playlist 
manager 230 to create playlist results sets. The playlisting creation 232 may be 
interfaced through a playlist application programming interface (API) 236. 

[0054] Lookups in the media recognition and management system 200 

maybe enabled through communication between the OS abstraction layer 208 
and a lookup server 222. The lookup server 222 may be in communication with 
an update manager 228, an encryption/decryption module 224 and a 
compression module 226 to effectuate the lookups. 

[0055] The media recognition module 246 may communicate with the 

update manager 228 and the lookup server 222 and be used to recognize media, 
such as by accessing media metadata 220 associated with the media items 218 
from the media database 210. In an embodiment. Compact Disks (audio CDs) 
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ana/or omer media items 21 8 can be recognized (or identified) by using Table of 
Contents (TOG) information or audio fingerprints. Once the TOC or the 
fingerprint is available, an application or a device can then look up the media 
Item 21 8 for the CD or other media content to retrieve the media metadata 220 
firom the media database 210. If the phonetic data 222 exists for the recognized 
media items 218, it may be made available in a phonetic transcription language 
such as X-SAMPA. The media database 210 may reside locally or be accessible 
over a network connection, hi an example embodiment, a phonetic franscription 
language may be a character set designed for accurate phonetic transcription (the 
representation of speech sounds with text symbols), hi an example embodiment, 
Extended Speech Assessment Methods Phonetic Alphabet (X-SAMPA) may be 
a phonetic transcription language designed to accurately model the International 
Phonetic Alphabet in ASCII characters. 

[0056] A content IDs delivery module 224 may deliver identification of 

content directly to a Hnk API 238, while a VOCs API 242 may communicate 
with the recognition media module 226 and a media-ID API 240. 

[0057] Referring to Figure 3, an example speech recognition and 

synthesis apparatus 300 for controlling operation of a playback device is 
illustrated, hi an example embodunent, the speech recognition and synthesis 
apparatus 104 (see Figure 1) may include the speech recognition and synthesis 
apparatus 300. The speech recognition and synthesis apparatus 300 may include 
an ASR/TTS system. 

[0058] ASR engine 1 1 2 may include speech recognition modules 3 1 4, 

3 1 6, 3 1 8, 320, which may know all commands supported by the media 
management system 106 as well as all media metadata 130, and upon 
recognition of a command the speech recognition engine 1 12 may send an 
appropriate command to a relevant handler (see Figure 1). For example, if a 
playHstmg appHcation is associated with the embodiment, the ASR engine 112 
may send an appropriate command to the playlisting application and then to the 
application layer/UI 108 (see Figure 1), which may then execute the request. 

[0059] Once the speech recognition and synthesis apparatus 300 has 
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oeen conngurea wim me appropriate data (e.g., phonetic metadata 128, 222 
customized for the music domain) the speech recognition and synthesis 
apparatus 300 may then be ready to respond to voice commands that are 
associated with the particular domain to which it has been configured. The 
phonetic metadata 128 may also be associated with the particular device on 
which it is resident. For example, if the device is a playback device, the phonetic 
data may be customized to accommodate commands such as "play," "play 
again," "stop," "pause," etc. 

[0060] The TTS engine 1 1 0 (see Figure 1) may include the speech 

synthesis modules 306, 308, 310, 312. Upon receiving a speech synthesis 
request, a client application may send the command to be spoken to the TTS 
engine 110. The speech synthesis modules 306, 308, 310, 312 may first look up 
a text string to be spoken in its associated dictionary or dictionaries. This 
phonetic representation of the text string that it finds in the dictionary may then 
be taken by the TTS engine 306 and the phonetic representation of the text string 
may be spoken (e.g., create a speaker output 302 of the text string). 

[00611 In an example embodiment, ASR grammar 3 1 8 may include a 

dictionary including all phonetic metadata 128, 222 and commands. It is here 
that commands such as "Play Artist," "More like this," "What is this," may be 
defined. 

[0062] In an example embodiment, the TTS dictionary 3 1 0 may be a 

binary or text TTS dictionary that includes all pre-defined pronunciations. For 
example, the TTS dictionary 3 10 may include all phonetic metadata 128, 222 
from the media database for the recognized content in the application database. 
The TTS dictionary 310 need not necessarily hold all possible words or phrases 
the TTS system could pronounce, as words not in this dictionary may be handled 
via G2P. 

[0063] After content recognition and an update of speech recognition and 

synthesis apparatus 300 fiinctionality has been performed, the user may be able 
to execute commands for speech recognition and/or speech synthesis. It will 
however be appreciated that the fiinctionality may be performed in other 
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appropnaie ways ana is not restricted to the description above. For example, a 
playback device may be preloaded with appropriate phonetic metadata 128, 222 
suitable for the music domain and which may, for example, be updated via the 
Internet or any other communication channel. 

[0064] In an example embodiment in which the speech recognition and 

synthesis apparatus 300 supports X-SAMPA, the phonetic metadata 128, 222 
may be provided as is. However, in embodiments in which the speech 
recognition and synthesis apparatus 300 seeks data in a different phonetic 
language, the apparatus 300 may include a character map to convert fix)m X- 
SAMPA to a selected phonetic language. 

[0065] The speech recognition and synthesis apparatus 3 00 may, for 

example, control a playback device in accordance as follows: A spoken input 
304 may be a command that is spoken (e.g., an oral conununication by a user) 
into an audio input (e.g., a microphone), such that when a user speaks the 
command, the associated speech may go into the ASR engine 314. Here, 
phonetic features such as pitch and tone may be extracted to generate a digital 
readout of the user's utterance. After this stage, the ASR engine 314 may send 
features to the search part of the speech recognition and synthesis apparatus 300 
for recognition. In a search stage, the ASR engine 3 14 may match the features it 
has extracted from the spoken command against the actual commands in its 
compiled grammar (e.g., a database of reference commands). The grammar may 
include phonetic data 128, 222 specific to a particular embodiment. The ASR 
engine 314 may use an acoustic model as a guide for average characteristics of 
speech for a given or selected language, allowing the matching of phonetic 
metadata 128, 222 with speech. Here, the ASR engine 3 14 may either return a 
matching command or a "fail" message. 

[0066] In an example embodiment, user profiles may be utilized to frain 

the speech recognition and synthesis apparatus 300 to better understand the 
spoken commands of a given individual so as to provide a higher rate of 
accuracy (e.g., a higher rate of accuracy in recognizing domain specific 
commands). This may be achieved by the user speaking a specific set of text 
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Strings into the speech recognition and synthesis apparatus 300, which are pre- 
defined and provided by the ASR system developer. For example, the text 
strings may be specific to the music domain. 

[0067] Once a matching command has been found, the ASR engine 314 

may produce a result and send a command to an embedded application. The 
embedded application can then execute the command. 

[0068] The TTS engine 306 may take a text (or phonetic) string and 

process into it into speech. The TTS engine 306 may receive a text command 
and, for example, using either G2P software or by searching a precompiled 
binary dictionary (equipped with provided phonetic metadata 128, 222), the TTS 
engine 306 may process the string. It will be appreciated the TTS functionality 
may also be customized to a specific domain (e.g., the music domain). The TTS 
result may "speak" the string (create a speaker output 302 corresponding to the 
text). 

[0069] In an example embodiment, along with the metadata, a list of 

typical voice command and control fimctions may also provided. These voice 
commands and control fimctions may be added to the default grammar for 
recompilation at runtime, at initialization or during development. A list of 
example command and control fiinctions (Supported Functions) is provided 
below. 

[0070] In an embodiment, while a grammar may be used and updated for 

speech recognition, a binary or a text dictionary may be needed for speech 
synthesis. Any text string may be passed to the TTS engine 306, which may 
speak the string using G2P and the pronunciations provided for it by the TTS 
dictionary 310. 

[0071] In an example embodiment, the speech recognition and synthesis 

apparatus 300 may support Grapheme to Phoneme (G2P) conversion, which may 
dynamically and automatically convert a display text into its associated phonetic 
transcription through a G2P module(s). G2P technology may take as input a 
plain text string provided by application and generate an automatic phonetic 
transcription. 
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1U072J Users may, for exaraple, control basic playback of music content 

via voice using ASR technology within an embedded device or with bundled 
products for the device that include recognition, management, navigation, 
playlisting, search, recommendation and/or linking to third party technology. 
Users may navigate and select specific artists, albums, and songs using speech 
commands. 

[0073] For example, using the speech recognition and synthesis 

apparatus 300, users may dynamically create automatic playlists using multiple 
criteria such as genre, era, year, region, artist type, tempo, beats per minute, 
mood, etc., or can generate seed-based automatic playlists with a simple spoken 
command to create a playlist of similar music. In an example embodiment, all 
basic playback commands (e.g., "Play," "Next," "Back," etc.) may be performed 
via voice commands. In addition, text-to-speech may also provide with 
commands like "More like this" or "What is this?" or any other domain specific 
commands. It will thus be appreciated that the speech recognition and synthesis 
apparatus 300 may facilitate and enhance the type and scope of commands that 
may be provided to a playback device such as an audio playback device by using 
voice commands. 

[0074] A table including examples of example voice commands that may 

be supported by the apparatus is shown below. 



Function 



Example 



Command 



sasMiiMiiiiiaiiiiii^ 



Basic Controls 

Play 

Stop 

Skip Track 
Prior Track 
Pause 

Repeat Track 



Tlay" 
"Stop" 
"Next" 
"Back" 
"Pause" 

"Repeat / Play It Again" 



Play 

Stop 

Next 

Back 

Pause 

Repeat 
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Content Item Playback 

Track Play 
Album Play 



"Play Songrrrack" <Summer In the Ctty> 
"Play Album" <Ex(le on Main Street> 



Play Song 
Play Album 



Disambiguation 

Play Other Artist^Album/Song/Etc. 



"Play Other <Nirvana>" 



Play Other 



Identify Content (wHTS of textual content) 



Identify Song and Artist 
Identify Artist 
Identify Album 
identify Song 
Identify Genre 
Identify Year 
Transcribe Lyric Line 

Custom Metadata Labeling 

Add Artist Nickname 
Add Album Nickname 
Add Song Nickname 
Add Alternate Command 
Add Song Nickname 

Set System Preferences 

Set preference how to announce all 
Artists 

Set preference how to announce all 
Albums 

Set preference how to announce all 
Tracks 

Set preference how to announce 
specific Artists 

Set preference how to announce 
specific Albums 

Set preference how to announce 
specific Tracks 



"What is This?" 
"Artist Name?" 
"Album Name?" 
"Song Name?" 
"Genre Name?" 
"What Year is This?" 
"What'd He Say?" 



"This Artist Nickname <Beck>" 
"This Album Nickname <Mellow Gold>" 
This Song Nickname <Pay No Mind>" 
Command <This Sucksf> Means <Rating 0>" 
"This Song Nickname <Pay No Mind>" 



"Use <Nicknames> for all <artists>" 



"Use <Nicknames> for all <albums>" 



"Use <Nicknames> for all <tracks>" 



"Use <Nicknames> for this <artist>'' 



"Use <Nicknames> for this <album>'* 



''Use <Nicknames> for this <track>" 



What is This? 
Artist Name? 
Album Name? 
Song Name? 
Genre Name? 
What Year is This?" 
What Did He Say? 



This Artist Nickname 
This Album Nickname 
This Song Nickname 
Command - Means 
This Song Nickname 



Use - for all 



Use - for all 



Use - for all 



Use - for this 



Use - for this 



Use - for this 
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PLAYLI5TING 



Static Playlists 

New Playlist 
Add to Playlist 
Delete From Playlist 



"New Playlist" <Our Parisian Adventure> New Playlist 

"Add to" <Our Parisian Adventure> Add to 

"Delete From"" <Our Parisian Adventure> Delete From 



Single-Factual Criterion Auto-Playlist 

Artist Play 
Composer Play 
Year Play 



"Play Artist"* <Becl<> 

Tlay Composer* <Stravinsky> 

"Play Year"<1996> 



Play Artist 
Play Composer 
Play Year 



Single-Descriptive Criterion Auto-Playllsts 



Genre Play 
Era Play 
Artist Type Play 
Region Play 

Play in Release Date Order 

Play Earliest Release Date Content 



"Play Genre/Style" <Big Band> 

Tlay Era/Decade" <80's> 

"Play Artist Type" <Female Solo> 

"Play Region' <Jamaica> 

"Play< Bob Dylan> in >Release Date> Order 

"Play Early <Beafles> 



Play Genre 
Play Era 
Play Artist Type 
Play Region 
Play In Order 
Play Early 



IntelliMix and IntelliMix Focus Variations 



Track IntelliMix 
Album IntelliMix 
Artist IntelliMix 
Genre IntelliMix 
Region IntelliMix 



"More Like This" 
"More Like This Album" 
"More Like This Artist" 
"More Like This Genre" 
"More Like This Region" 



More Like This 
More Like This Album 
More Like This Artist 
More Like This Genre 
More Like This Region 



"Play The Rest" 
More from Album 
More from Artist 
More from Genre 



"Play This Album" 
"Play This Artisf 
"Play This Genre" 



Play this album 
Play this artist 
Play this genre 



Edit / Adjust Current Auto-PIaylist 

Play Older Songs 
Play More Popular 



"Older" 

"More Popular" 



Older 

More Popular 
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Define/Generate & Play New Auto-Playlist 



Decade/Genre Auto PL 
Origin/Genre Auto PL 
Type/Genre Auto PL 



"New Mix" <70's Funk> 

"New Mix" < French EIectronlca> 

"New Mix" <Femate Singer-Songwriters> 



New Mix 
New Mix 
New Mix 



Save Auto-Playlist Definition 
Save User-Defined AutoPL 
Save Auto-PL Results as Fixed PL 



"Save Mix As" <Darcy's Party Mlx>" 
"Save Playlist As" <Darcy's Party Mix>- 



Save Mix As 
Save Pfaylist As 



Re-MIx / Play Saved Auto-Playlist Definition 

Play User-Defined AutoPL Tlay Mix" <Darcy's Party Mlx>" 

Play Preset AutoPL "Play Mix" <Rock On, Dude> 



Play Mix 
Play Mix 



Explicit Rating 
Rate Track 
Rate Album 
Rate Artist 
Rate Year 
Rate Region 

Change User Profile 

Change User 

Add User (for combo profiles) 

Descriptor Assignment 

Edit Artist Descriptor 

Edit Album Descriptor 

Edit Song Descriptor 

Assign Artist Similarity 

Assign Album Similarity 

Assign Song Similarity 

Create User Defined Playlist Criteria 

Assign User-Defined PL Criteria 

Banishing 

Banish Track from all Playback 
Banish Album from all Auto-PLs 



"Rating 9" 
"Rate Album 7" 
"Rate Artist 0 " 
"Rate Year 10" 
"Rate Region 4" 



"Sign in <Samanfha>" 
"Also Sign In <Evan>" 



Tills Artist Origin <Brazil>" 
This Album Era <50's>" 
This Song Genre <Ragtime>" 
This Artist Similar <Nicf< Draf<e>'* 
"This Album Similar <Bryter Layter>" 
This Song Similar <Ceilo Songy 
"Create Tag <Radicaii>*' 
Tag <Radicall>" 



Rating 
Rate Album 
Rate Artist 
Rate Year 
Rate Region 



Sign in 
Also Sign In 



This Artist Origin 
This Album ERa 
This Song Genre 
This Artist Similar 
This Album Similar 
This Song Similar 
Create Tag 
Tag 



'Never Again" 
"Banish Album" 



Never Again 
Banish 
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Banish Artist from Specific AutoPL 



"Banish Artist from Mix" 



Banish from Mix 



3"* PARTY CONTENT LINKING 



Related Content Request 

Hear Review 
Hear Bio 

Hear Concert Info 



"Review" 
"Bio- 
Tour" 



Review 

Bio 

Tour 



Commerce 
Dovmload Track 
Download Album 
Buy Ticket 



"Download Track" 
"Download Album" 
"Buy Ticket" 



Download Track 
Download Album 
Buy Ticket 



NAVIGATION 



Multi-Source (e.g. Local files, DIgHal AM/FM. Satellite Radio, Internet Radio) Search 

Inter-Source Artist Nav "Find Artist <Frank Sinatra>" 

Inter-Source Genre Nav "Find Genre <Reggae>'' 



Find Artist 
Find Genre 



Similar Content Browsing 

Similar Artist Browse 
Similar Genre Browse 
Similar Playlist Browse 



"Find Similar Artists" 
"Find Similar Genres" 
"Find Similar Playlists" 



Find Similar Artists 
Find Similar Genres 
Find Similar Piaytists 



Browsing via TTS Category Name 

Genre Hierarchy Nav 
Era Hierarchy Nav 
Origin Hierarchy Nav 
Era / Genre Hierarchy Nav 
Browse Parent Category 
Browse Child Category 
Pre-Set Playlist Nav 
Auto-Playllst Nav 
Auto-Playlist Category Nav 
Similar Origin Nav 
Similar Artists Nav 



Listing 

"Browse <Jazz> <Albums>'' 
"Browse <60's> <Tracks>" 
"Browse <Africa> <Artists>" 
"Browse <40's> <Jaz2> <Artlsts>" 
"Up Level" 
"Down Level" 
"Browse Pre-Sets" 
"Browse Playlists" 
"Browse Driving Playlists" 
"Browse Similar Regions" 
"Browse Simitar Artists" 



Browse 

Browse 

Browse 

Browse 

Up Level 

Dovm Level 

Browse 

Browse 

Browne 

Browse 

Browse 
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Browsing via 4-Second Audio Preview Listing 



Genre Track Clip Scan 
Artist Track Clip Scan 
Origin Track Clip Scan 
Pre-Set AutoPL Clip Scan 
Similar Tracks Scan 



"Scan Motown* 
"Scan Pink Floyd" 
"Scan Italy" 

"Scan Pre-Set <Sunday Moming>'' 
"Scan Similar Tracks" 



Scan 
Scan 
Scan 
Scan 
Scan 



RECOIVIENDATIONS 



Track Recommendations 
Album Recommendations 
Artist Recommendations 



Suggest More Tracks 
Suggest More Albums 
Suggest More Artists 



Suggest More Tracks 
Suggest More Albums 
Suggest More Artists 



Table 1: Example Voice Commands 

[0075] Referring to Figure 4, an example media data structure 400 is 

illustrated. In an example embodiment, the media data structure 400 may be 
used to represent media metadata 130, 220 for media content, such as for the 
media items 218 (see Figures 1 and 2). The media data structure 400 may 
include a first field with a media title array 402, a second field with a primary 
artist array 404, and a third field with a track array 406. 

[0076J The media title array 402 may include an official representation 

and one or more alternate representations of a media title (e.g., a title of an 
album, a title of a movie, and a title of a television show). The primary artist 
name array 404 may include an official representation and one or more alternate 
representations of a primary artist name (e.g., a name of a band, a name of a 
production company, and a name of a primary actor). The track array 406 may 
include one or more tracks (e.g., digital audio tracks of an album, episodes of a 
television show, and scenes in a movie) for the media title. 

[0077] By way of an example, the media title array 402 may include 

"Led ZeppeHn IV", "Zoso", and "Untitled", the primary artist name array 404 
may include "Led Zeppelin" and "The New Yardbirds", and the track array 406 
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may include "Black Dog", "Rock and Roll", "The Battle of Evermore", 
"Stairway to Heaven", "Misty Mountain Hop", "Four Sticks", "Going to 
California", and "When the Levee Breaks". 

[0078] In an example embodiment, the media data structure 400 may be 

retrieved through a successful lookup event, either online or local. For example, 
media-based lookups (e.g., CD-based lookups and DVD-based lookups) may 
return media data structures 400 that provide information for every track on a 
media item, while a file-based lookup may return the media data structure 400 
that provides information only for a recognized track. 

[0079] Referring to Figure 5, an example track data structure 500 is 

illustrated. In an example embodiment, each element of the track array 406 (see 
Figure 4) may include the track data structure 500. 

[0080] The track data structure 500 may include a first field with a track 

title array 502 and a second field with a track primary artist name array 504. 
The track title array 502 may include an ofBcial representation and one or more 
alternate representations of a track title. The track primary artist name array 504 
may include an official representation and one or more altemate representations 
of a primary artist name of the track. 

[0081] Referring to Figure 6, an example command data structure 600 is 

illustrated. The conunand data structure 600 may include a first field with a 
command array 602 and a second field with a provider name array 604. In an 
example embodiment, the command data structure 600 may be used for voice 
conunands used with the speech recognition and synthesis apparatus 300 (see 
Figure 3). 

[0082] The command array 602 may include an official representation 

and one or more altemate representations of a command (e.g., navigation control 
and control over a playKst). The provider name array 604 may include an 
ofBcial representation and one or more altemate representations of a provider of 
the command. For example, the command may enable navigation, playlisting 
(e.g., the creation and/or use of one or more play lists of music), play control 
(e.g., play and stop), and the like. 
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lU083i Keterring to Figure 7, an example text array data structure 700 is 

illustrated. In an example embodiment, the media title array 402 and/or the 
primary artist array 404 (see Figure 4) may include the text array data structure 
700. In an example embodiment, the track title array 502 and/or the track 
primary artist name array 504 (see Figure 5) may include the text array data 
structure 700. la an example embodiment, the command array 602 and/or the 
provider name array 604 (see Figure 6) may include the text array data structure 
700. 

[0084] The example text array data structure 700 may include a first field 

with an official representation flag 702, a second field with display text 704, a 
third field with a written language identification (ID) 706, and a fourth field with 
a phonetic transcription array 708. 

[0085] The official representation flag 702 may provide a flag for the 

text array data structure 700 to indicate whether the text array data structure 700 
represents an official representation of the phonetic transcript (e.g., an official 
phonetic transcription) or an alternate representation of the phonetic transcript 
(e.g., an alternate phonetic transcription). For example, a flag may indicate that 
a titie or name is an official name. 

[0086] In an example embodiment, the official phonetic transcription 

may be a phonetic transcription of a correct pronunciation of a text string, hi an 
example embodiment, the alternate phonetic transcription may be a common 
mispronunciation or alternate pronunciation of a text stiing. The alternate 
phonetic ti-anscriptions may include phonetic ti-anscriptions of common non- 
standard pronunciation of a text stiing, such as may occur due to user error (e.g., 
incorrect pronunciation phonetic tianscription). The alternate phonetic 
h-anscriptions may also include phonetic ti-anscriptions of common non-standard 
pronunciation of a text string, occurring due to regional language, local dialect, 
local custom variances and/or general lack of clarity on correct pronunciation 
(e.g., the phonetic transcriptions of alternate pronunciations). 

[0087] In an example embodiment, the official representation may be 

generally associated with a text that appears on an officially released media 



21 



wo 2007/022533 



PCT/US2006/032722 



ana/or eaitonaiiy aecided. For example, an official artist name, an album title, 
and a track title may ordinarily be found on an original packaging of distributed 
media. In an example embodiment, the official representation may be a single 
normalized name, in case an artist has changed an official name during a career 
(e.g.. Price and John Mellencamp). 

[0088] In an example embodiment, the altemate representation may 

include a nickname, a short name, a common abbreviation, and the like, such as 
may be associated with an artist name, an album title, a track title, a genre name, 
an artist origin, and an artist era description. As described in greater detail 
below, each altemate representation may include a display text and optionally 
one or more phonetic transcriptions, hi an example embodiment, the phonetic 
transcription may be a textual display of a symboKzation of sounds occurring in 
a spoken human language. 

[0089] The display text 704 may indicate a text string that is suitable for 

display to a human reader. Examples of the display text 704 include display 
strings associated witii artist names, album titles, track titles, genre names, and 
the like. 

[0090] The written language ID 706 may optionally indicate an origin 

written language of the display text 704. By way of an example, the written 
language ID 706 may indicate that the display text of "Los Lonely Boys" is in 
Spanish. 

[0091] The phonetic transcription array 708 may include phonetic 

transcriptions in various spoken languages (e.g. American English, United 
Kingdom English, Canadian French, Spanish, and Japanese), Each language 
represented in the phonetic transcription array 708 may include an official 
pronunciation phonetic transcription and one or more altemate pronunciation 
phonetic transcriptions. 

[0092] Li an example embodiment, the phonetic transcription array 708 

or portions thereof may be stored as the phonetic metadata 128, 222 within the 
media database 126, 210. 

[0093] In an example embodiment, the phonetic transcriptions of the 
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pnoneuc iranscnpnon array 708 may be stored using an X-S AMPA alphabet. In 
an example embodiment, the phonetic transcriptions may be converted into 
another phonetic alphabet, such as L&H+. Support for a specific phonetic 
alphabet may be provided as part of a software library build configuration. 

[0094] The display text 704 may be associated with the official phonetic 

transcriptions and alternate phonetic transcriptions of the phonetic transcription 
array 708 by creating a dictionary, which may be provided and used by the 
speech recognition and synthesis apparatus 300 (see Figure 3) in advance of a 
recognition event, hi an example embodiment, the display text 704 and 
associated phonetic transcriptions may be provided on an occurrence of a 
recognition event. 

[0095] Phonetic transcriptions of alternate pronunciations, or phonetic 

variants, of most commonly mispronounced strings for the phonetic metadata 
128, 222 may be provided. The alternate pronunciations or phonetic variants 
may be used to accommodate the automated speech recognition engine 112 to 
handle many plaintext strings using Grapheme-to-Phoneme technology. 
However, recognition may be problematic on a few notable exceptions (such as 
artist names Sade, Beyonce, AC/DC, 311, B-52s, R.E.M., etc.). hi addition or 
instead, an embodiment may include phonetic variants for names commonly 
mispronounced by users. For example, artists like Sade (e.g., mispronounced 
/'seld/), Beyonce (e.g., mispronounced /bi.'jans/) and Brian Eno (e.g., 

mispronounced /•£._noo/). 

[0096] In an example embodiment, phonetic representations are provided 

of an alternate name that an artist could be called, thus lessening the rigidity 
usually found in ASR systems. For example, content can be edited such that the 
commands "Play Artist: Frank Sinatra," "Play Artist: OV Blue Eyes," "Play 
Artist: The Chairman of the Board" are all equivalent. 

[0097] By way of a series of examples, a first use case may be for the 

Beach Boys, which may have one phonetic transcription in English that says the 
"Beach Boys". A second use case (e.g., for a nickname) may be for Elvis 
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Presley, who has associated with his name a nickname, namely, "The King" or 
the "King of Rock and Roll". Each of the strings for the nickname may have a 
separate text array data structure 700 and have an official phonetic transcription 
within the phonetic transcription array 708 associated therewith. A third use 
case (e.g., for a multiple pronunciation) may be for the Eisley Brothers. The 
Eisley Brothers may have a single text array data structure 700 with a first 
official phonetic transcription for the Eisley Brothers and a second 
mispronunciation transcription for the Isley Brothers in the phonetic 
transcription array 708. 

[0098] Further with the foregoing example, a fourth use case (e.g., for 

multiple languages) may have an artist Los Lobos that has a phonetic 
transcription in Spanish. The phonetic metadata 128 in the media database 126 
may be stored in Spanish, the phonetic transcription may be stored in Spanish 
and tagged accordingly. A fifth use case (e.g., a foreign language in a nickname 
and a regionalized exception) may include a foreign language nickname, such as 
Elvis Presley's nickname of "Mao Wong" in China, The phonetic transcription 
for the nickname may be stored as Mao Wong and the phonetic transcription 
may be associated with the Chinese language. A sixth use case (e.g., 
mispronunciation regionalized exception) may be for ACDC. AC/DC may have 
an associated official transcription in English that is AC/DC, and a French 
transcription for ACDC that will be provided when the spoken language is 
French. 

[0099] Referring to Figure 8, an example phonetic transcription data 

structure 800 is illustrated. In an example embodiment, each element of the 
phonetic transcription array 708 (see Figure 7) may include the phonetic 
transcription data structure 800. For example, phonetic transcriptions may 
include the phonetic transcription data structure 800. 

[00100] The phonetic transcription data structure 800 may include a first 
field with a phonetic transcription string 802, a second field with a spoken 
language ID 804, a third field with an origin language transcription flag 806, and 
a fourth field with a correct pronunciation flag 808. 
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[00101] The phonetic transcription string 802 may include a text string of 
phonetic characters used for pronxmciation. For example, the phonetic 
transcription string 802 may be suitable for use by an ASR/TTS system. 

[00102] In an example embodiment, the phonetic transcription string 802 
may be stored in the media database 126 in a native spoken language (e.g., an 
origin language of the phonetic transcription string 802). 

[00103] Li an example embodiment, an alphabet used for the string of 

phonetic characters may be stored in a generic phonetic language (e.g., X- 
SAMPA) that maybe translated to ASR and/or TTS system specific character 
codes. In an example embodiment, an alphabet used for the string of phonetic 
characters may be L&H+. 

[00104] The spoken language ID 804 may optionally indicate an origin 
spoken language of the phonetic transcription string 802. For example, the 
spoken language ED 804 may indicate that the phonetic transcription string 802 
captures how a speaker of a language identified by the spoken language ID 804 
may utter an associated display text 704 (see Figure 7). 

[00105] The origin language transcription flag 806 may indicate if the 
transcription corresponds to the written language ID 706 of the display text 704 
(see Figure 7). In an example embodiment, the phonetic transcription may be in 
an origin language (e.g., a language in which the string would be spoken) when 
the phonetic transcription is in a same language as the display text 704. 

[00106] The correct pronunciation flag 808 may indicate whether the 

phonetic transcription string 802 represents a correct pronunciation in the spoken 
language identified by the spoken language ID 804. 

[00107] In an example embodiment, a correct pronunciation may be when 
a pronunciation it is generally accepted by speakers of a given language as being 
correct. Multiple correct pronunciations may exist for a single display text 704, 
where each such pronunciation represents the "correct" pronunciation in a given 
spoken language. For example, the correct pronunciation for "AC/DC" in 
English may have a different phonetic transcription (ay see dee see) fi-om the 
phonetic transcription for the correct pronunciation of "AC/DC" in French (ah 
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say deh say). 

[00108] In an example embodiment, a mispronunciation may be when a 
pronunciation it is generally accepted by speakers of a given language as being 
mispronounced. Multiple mispronunciations can exist for a single display text 
704, where each such pronunciation may represent the mispronunciation in a 
given spoken language. For example, the incorrect pronunciation phonetic 
transcriptions may be provided to an embedded application in the cases where 
the mispronunciations are common enough that then- utterance by users is 
relatively likely. 

[00109] In an example embodiment, to retrieve the phonetic transcriptions 
(e.g., for correct pronunciations and mispronunciations) in the target spoken 
language for a representation (e.g., an artist name, a media title, etc.), a phonetic 
transcription array 708 (see Figure 7) of a representation may be traversed, the 
target phonetic transcription strings 802 may be retrieved, and the correct 
pronunciation flag 808 of each phonetic transcription maybe queried. 

[00110] In an example embodiment, data from the media data structure 
400 including display text 704, the phonetic transcriptions of the phonetic 
transcription array 708, and optionally the spoken language IDs 804 may be used 
to populate the grammar 318 and the dictionaries 310 (and optionally other 
dictionaries) for the speech recognition and synthesis apparatus 300 (see Figure 
3). 

[001 11] Referring to Figure 9, an example altemate phrase mapper data 
structure 900 is illustrated. The altemate phrase mapper data structure 900 may 
include a first field with an altemate phrase 902, a second field with an ofBcial 
phrase array 904 and a third field with a phrase type 906. The altemate phrase 
mapper data stracture 900 may be used to support an altemate phrase mapper, 
the use of which is described in greater detail below. 

[001 12] The altemate phrase 902 may include an altemate phrase to an 
ofBcial phrase, where a phrase may refer to an artist name, a media or track title, 
a genre name, a description (of an artist type, artist origin, or artist era), and the 
like. The official phrase array 904 may include one or more official phrases 
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associated with the alternate phrase 902. 

[00113] For example, alternate phrases may include nicknames, short 
names, abbreviations, and the like that are commonly known to represent a 
person, album, song, genre, or era which has an official name. Contributor 
alternate names may include nicknames, short names, long names, birth names, 
acronyms, and initials. A genre alternate name may include "rhyflim and blues" 
where the official name is "R&B". Each artist name, album title, track title, 
genre name, and era description for example may potentially have one or more 
alternate representations (e.g., an alternate phonetic transcription for the alternate 
phrase) aside firom its official representation (e.g., an official phonetic 
transcription for the alternate phrase). 

[001 14] In an example embodiment, the phonetic transcription for the 
alternate phrase may be a phonetic transcription of a text string that represents an 
alternative name to refer to another name (e.g., a nickname, an abbreviation, or a 
birth name). 

[00115] hi an example embodiment, the alternate phrase mapper may use 
a separate database, whereupon each successful lookup the alternate phrase 
mapper database maybe automatically populated with the alternate phrase 
mapper data structures 900 mapping alternate phrases (if any exist in the 
retumed media data) to official phrases. 

[001 16] In an example embodiment, phonetic transcriptions for alternate 
phrases may be stored as dictionaries (e.g., a contributor phonetic dictionary 
and/or a genre phonetic dictionary) within the dictionary entry 320 of a speech 
recognition and synthesis apparatus 300 to enable a user to speak an alternate 
phrase as an input instead of the official phrase (see Figure 3). The use of the 
dictionaries may enable the ASR engine 3 14 to match a spoken input 116 to a 
correct display text 704 (see Figure 7) firom one of the dictionaries. The text 
command 316 firom the ASR engine 314 may then be provided for fiirther 
processing, such as to VOCs application layer 124 and/or playlist application 
layer 122 (see Figures 1 and 3). 

[00117] The phrase type 906 may include a type of the phrase, such as 
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may correspona to me media data structure 400 (see Figure 4). For example, 
values of the phrase type 906 may include an artist name, an album title, a track 
title, and a command. 

[00118] Referring to Figure 10, a method 1 000 for managing phonetic 
metadata 128, 222 on a database in accordance with an example embodiment is 
illustrated. In an example embodiment, the database may include the media 
database 126, 210 (see Figures 1 and 2). 

[001 19] The database may be accessed at block 1 002. At decision block 
1 004, a determination may be made as to whether the phonetic metadata 128, 
222 will be altered. If the phonetic metadata 128, 222 will be altered, the 
phonetic metadata 128, 222 is altered at block 1006. An example embodiment 
of altering the phonetic metadata 128, 222 is described in greater detail below. 
If the phonetic metadata 128, 222 will not be altered at decision block 1004 or 
after block 1006, the method 1000 may then proceed to decision block 1008. 

[00120] A determination may be made at decision block 1 008 as to 
whether metadata (e.g., phonetic metadata 128, 222 and/or media metadata 130, 
220) should be provided from the database. 

[00121] If the metadata is to be provided, the metadata is provided from 
the database at block 1010. In an example embodiment, providing the metadata 
may include providing requested metadata for the data to the local library 
database 118 (see Figure 1). 

[00122] Li an example embodiment, the phonetic metadata 128 for 
regional phonetic transcriptions may be provided from and/or to the database 
and may be stored in a native spoken language of a target region. 

[00123] In an example embodiment, providing the metadata at block 1010 
may include analyzing a music library of an embedded application to determine 
the accessible digital audio tracks and create a contributor/artist phonetic 
dictionary and a generic phonetic dictionary with the speech recognition and 
synthesis apparatus 300 (see Figure 3). For example, the phonetic metadata 
128, 222 for all associated spoken languages that may be supported for a given 
application may be received and stored for use by an embedded application at 
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DIOCK 101 U. 

[00124] If the metadata is not to be provided at decision block 1 008 or 
after block 1010, the method 1000 may proceed to decision block 1012 to 
determine whether to terminate. If the method 1000 is to continue operating, the 
method 1000 may return to decision block 1004; otherwise the method 1000 
may terminate. 

[00125] In an example embodiment, the metadata may be provided in 
real-time at block 1010 whenever a recognition event occurs, such as by 
interesting a CD in a device nmning the embedded appUcation, upload a file for 
access by the embedded, the command data for music navigation is acquired, 
and the Uke. In an example embodiment, providing phonetic metadata 128, 222 
dynamically may reduce search time for matching data within an embedded 
application. 

[00126] In an example embodiment, alternate phrase data used by an 
alternate phrase mapper may be provided in the same manner as the phonetic 
metadata 128, 222 at block 1010. For example, the alternate phrase data may 
automatically be a part of the media metadata 130, 220 that is returned by a 
successful lookup. 

[00127] Referring to Figure 11, a method 1 1 00 for altering phonetic 
metadata of a database in accordance with an example embodiment is illustrated. 
The method 1 100 may be performed at block 1002 (see Figure 10). In an 
example embodiment, the database may include the media database 126, 210 
(see Figures 1 and 2). A string may be accessed at block 1 102, such as from 
among a plurality of strings contained within the fields of the media metadata 
220. In an example embodiment, the string may describe an aspect of the media 
item 218 (see Figure 2). For example, the string maybe a representation of a 
media title of the media title array 402, a representation of a primary artist name 
of the primary artist name array 404, a representation of a track title of the track 
title array 502, a representation of a primary artist name of the track primary 
artist name array 504, a representation of a command of the command array 602, 
and/or a representation of a provider of the provider name array 604. 
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tUU128J At decision block 1 1 04, a determination may be made as to 
whether a written language m 706 (see Figure 7) should be assigned to the 
string. If the method 1 100 determines that the written language ID 706 of the 
string should be assigned, the written language ID 706 of the string may be 
assigned at block 1 1 06. By way of example, Celine Dion may be assigned the 
spoken language of Canadian French and Los Lobos may be assigned the spoken 
language of Spanish. 

[00129] In an example embodiment, the determination of associating a 

string with the written language ID 706 may be made by a content editor. For 
example, the determination of associating a string with a written language may 
be made by accessing available information regarding the string, such as from a 
media-related website (e.g., AllMusic.com and Wikipedia.com). 

[00130] If the method 1 1 00 determines that the written language of the 
string should not be assigned and/or reassigned (e.g., as the string already has a 
correct written language assigned) at decision block 1 1 04 or after block 1 1 06, 
the method 1 100 may proceed to decision block 1 108. 

[00131] Upon completion of the operation at block 1 106, the method 1 100 

may assign an official phonetic transcription to the string, such as through an 
automated source that uses processing to generate the phonetic transcription in 
the spoken language of the string. 

[00132] The method 1 1 00 at decision block 1 1 08 may determine whether 
an action should be taken with an official phonetic transcription for the string. 
For example, the official phonetic transcription may be retained with the 
phonetic transcription array 708 (see Figure 7). If an action should be taken 
within the official phonetic transcription for the string, the official phonetic 
transcription for the string may be created, modified and/or deleted at block 
1110. If the action should not be taken with the official phonetic transcription 
for the string at decision block 1 108 or after block 1 1 10, the method 1 100 may 
proceed to decision block 1112. 

[00133] At decision block 1 1 1 2, the method 1 1 00 may determine whether 
an action should be taken with one or more alternate phonetic transcriptions. For 
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example, one or more of the alternate phonetic transcriptions may be retained 
with the phonetic transcription array 708. If an action should be taken with the 
alternate phonetic transcription for the string, the alternate phonetic transcription 
for the string may be created, modified and/or deleted at block 1 1 14. If an 
action should not be taken with the official phonetic transcription for the string 
at decision block 1 1 12 or after block 1 1 14, the method 1 100 may proceed to 
decision block 1116. 

[00134] In an example embodiment, the alternate phonetic transcriptions 
may be created for non-origin languages of the string. 

[00135] In an example embodiment, alternate phonetic transcriptions are 
not created for each spoken language in which the string may be spoken. Rather, 
altemate phonetic transcriptions may be created for only the spoken languages in 
which the phonetic transcription would sound incorrect to a spealcer of the 
spoken language. 

[00136] The method 1 100 at decision block 1116 may determine whether 
further access is desired. For example, fiarther access may be provided to a 
current string and/or another string. If fiarther access is desired, the method 1 100 
may return to block 1 102. If fiirther access is not desired at decision block 1116, 
the method 1 100 may terminate. 

[00137] In an example embodiment, the phonetic transcriptions may 
undergo an editorial review in supported languages. For example, an English 
speaker may listen to the English phonetic transcriptions. When transcriptions 
are not stored in English, the English speaker may listen to the phonetic 
transcriptions stored in a non-English language and translated into English. The 
English speaker may identify phonetic transcriptions that need to be replaced, 
such as with a regionalized exception for the phonetic transcription. 

[00138] Referring to Figure 12, a method 1200 for using metadata with 

an application in accordance with an example embodiment is illustrated. In an 
example embodiment, the application may be an embedded application. 
Accordingly, the method 1200 may be deployed and integrated into any audio 
equipment such as mobile MP3 players, car audio systems, or the like. 
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[00139] 



Metadata (e.g., phonetic metadata 128, 222 and/or media 



metadata 130, 220) may be configured and accessed for the application at block 
1202 (see Figures 1-3). An example embodiment of configuring and accessing 
metadata for the application is described in greater detail below. 



metadata, the providing the phonetic metadata 128, 222 for a media item may be 
reproduced with speech synthesis. In an example embodiment, after configuring 
and accessing the metadata, the providing the phonetic metadata 128, 222 and/or 
media metadata 130, 220 may be provided to a third party device during access 
of the media item. 

[00141] The method 1200 may re-access and re-configure metadata at 
block 1202 based on the accessibility of additional media. 

[00142] At decision block 1 204, the method 1 200 may determine whether 

to invoke voice recognition. If the voice recognition is to be invoked, a 
command may be processed by the speech recognition and synthesis apparatus 
300 (see Figure 3) at block 1206. An example embodiment of a method for 
processing the command with voice recognition is described in greater detail 
below. If the voice recognition is not to be invoked at decision block 1204 or 
after block 1206, the method 1200 may proceed to decision block 1208. 

[00143] The method 1200 at decision block 1208 may determine whether 
to invoke speech synthesis. If speech synthesis is to be invoked, the method 
1200 may provide an output string through the speech recognition and synthesis 
apparatus 300 at block 1210. An example embodiment of a method for 
providing an output string by the speech recognition and synthesis apparatus 300 
is described in greater detail below. If speech synthesis is not to be invoked at 
decision block 1208 or after block 1210, the method 1200 may proceed to 
decision block 1214. 

[00144] At decision block 1 2 1 4, the method 1 200 may determine whether 

to terminate. If the method 1200 is to fiirther operate, the method 1200 may 
return to decision block 1204; otherwise, the method 1200 may terminate. 

[00145] Referring to Figure 13, a method 1300 for accessing and 



[00140] 



In an example embodiment, after configuring and accessing the 



32 



wo 2007/022533 



PCT/US20D6/032722 



conngunng metadata tor an application in accordance with an example 
embodiment is illustrated. In an example embodiment, the application may be 
the embedded application. The method 1300 may, for example, be performed at 
block 1202 (see Figure 12). 

[00146] At decision block 1302, the method 1300 may determine whether 

to access and configure music metadata and the associated phonetic metadata 
128, 222 (see Figures 1 and 2). If the music metadata and the associated 
phonetic metadata 128, 222 is to be accessed and configured, the method 1300 
may access and configure the music metadata and the associated phonetic 
metadata 128, 222 at block 1304. An example embodiment of configuring 
media metadata 130, 220 (e.g., music metadata) is described in greater detail 
below. If the music metadata and the associated phonetic metadata 128, 222 is 
not to be accessed and configured at decision block 1302 of after block 1304, the 
method 1300 may proceed to decision block 1306. 

[00147] The method 1300 at decision block 1306 may determine whether 
to access and configure navigation metadata and the associated phonetic 
metadata 128, 222. If the navigation metadata and the associated phonetic 
metadata 128, 222 is to be accessed and configured, the method 1300 may 
access and configure the navigation metadata and the associated phonetic 
metadata 128, 222 at block 1308. An example embodiment of configuring 
media metadata 130, 220 (e.g., navigation metadata) is described in greater 
detail below. If the navigation metadata and the associated phonetic metadata 
128, 222 is not to be accessed and configured at decision block 1306 of after 
block 1308, the method 1300 may proceed to decision block 1310. 

[00148] At decision block 1310, the method 1300 may determine whether 
to access and configure other metadata and the associated phonetic metadata 
128, 222. If the other metadata and the associated phonetic metadata 128, 222 is 
to be accessed and configured, the method 1300 may access and configure the 
other metadata and the associated phonetic metadata 128, 222 at block 1312. An 
example embodiment of configuring media metadata 130, 220 is described in 
greater detail below. If the other media metadata and the associated phonetic 
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metaaata 1:^8, is not to be accessed and configured at decision block 1310 of 
after block 1312, the method 1300 may proceed to decision block 1314. 

[00149] In an example embodiment, the other metadata may include 
playlisting metadata. For example, users may input their own pronunciation 
metadata for either a portion of the core metadata or for a voice command, as 
well as assign genre similarity, ratings, and other descriptive information based 
on their personal preferences at block 1312. Thus, a user may create his or her 
own genre, rename The Who as "My Favorite Band," or even set a new syntax 
for a voice command. Users could manually enter custom variants using a 
keyboard or scroll pad interface in the car or by speaking the variants by voice. 
An altemate solution may enable users to add custom phonetic variants by 
spelling them out aloud. 

[00150] The method 1 3 00 may determine whether ftirther access and 
configuration of the media metadata 130, 220 and associated phonetic metadata 
128, 222 is desired at decision block 1314. If further access and configuration is 
desired, the method may return to decision block 1302. If fiirther access and 
configuration is not desired at decision block 13 14, the method 1300 may 
terminate. 

[00151] Referring to Figure 14, a method 1400 for accessing and 
configuring media metadata for an application in accordance with an example 
embodiment is illustrated. In an example embodiment, the method 1400 may be 
performed at block 1304, block 1308 and/or block 1312 (see Figure 13). 

[00152] One or more media items (e.g., digital audio tracks, digital video 
segments, and navigation items) may be accessed from a media library at block 
1402. hi an example embodiment, the media library may be embodied within 
the media database 126, 210 (see Figures 1 and 2). In an example embodiment, 
the media library may be embodied within the local library database 118 (see 
Figures 1). 

[00153] The method 1400 may attempt recognition of the media items at 
block 1404. At decision block 1406, the method 1400 may determine whether 
the recognition was successful. If the recognition was successfiil, the method 
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14UU may access the media metadata 130, 220 and associated phonetic metadata 
128, 222 at block 1408 and conJBgure the media metadata 130, 220 and 
associated phonetic metadata 128, 222 at block 1410. If the recognition was not 
successful at decision block 1406 or after block 1410, the method 1400 may 
terminate. 

[00154] In an example embodiment, a device implementing the 
application operating the method 1400 may be used to control, navigate, playlist 
and/or link music service content which akeady may contains linked identifiers 
such as on-demand streaming, radio streaming stations, satellite radio, and the 
Hke. Once the content is successfully recognized at decision block 1406, the 
associated metadata and phonetic metadata 128, 222 may then be obtained at 
block 1408 and configured for the apparatus at block 1410. 

[00155] In the example music domain, some artists or groups may share 

the same name. For example, the 90's rock band Nirvana shares its name with a 
70's Christian folk group, and the 90's and OO's Cahfomia post-hardcore group 
Camera Obscura shares its name with a Glaswegian Indie pop group. 
Furthermore, some artists share nicknames with the real names of other artists. 
For example, Frank Sinatra is known as "The Chairman of the Board," which is 
also phonetically very similar to the name of a soul group fi:om the 70's called 
"The Chairmen of the Board". Further, ambiguity may result fi-om the rare 
occurrence that, for example, the user has both Camera Obscura bands on a 
portable music player (e.g., on hard drive of the player) and the user then 
instructs the apparatus to "Play Camera Obscura." 

[00156] Example methodology may be employed to accommodate 

duplicate names may be as follows. In an embodiment, selection of artist or 
album to play may be based upon previous playing behavior of a user or explicit 
input. For example, assume that the user said "Play Nirvana" having both Kurt 
Cobain's band and the 70's folk band on the user's playback device (e.g., 
portable MP3 player, personal computer, or the like). The apphcation may use 
playlisting technology to check both play fi-equency rates for each artist and play 
frequency rates for related genres. Thus, if the user frequently plays early-90's 
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grunge then the grunge Nirvana may be played; if the user frequently plays folk, 
then the folk Nirvana may be played. The apparatus may allow toggling or 
switching between a preferred and a non-preferred artist. For example, if the user 
wants to hear folk Nirvana and gets grunge Nirvana, the user can say "Play 
Other Nirvana" to switch to folk Nirvana. 

[00157] In addition or instead, the user may be prompted upon recognition 
of more than one match (e.g., more than one match per album identification). 
When, for example, the user says "Play artist Camera Obscura," the apparatus 
will find two entries and prompt (e.g., using TTS fimctionality) the user: "Are 
you looking for Camera Obscura from California, or Camera Obscura from 
Scotland?" or some other disambiguating question which uses other items in the 
media database. The user is then able to disambiguate the request themselves. It 
will be appreciated that when the apparatus is deployed in a navigation 
environment, town/city names, street names or the like may also be processed in 
a similar fashion. 

[00158] In an example embodiment, where an album series exists where 
each album has the same name other than a volume number (e.g., the "Vol. X"), 
any identical phonetic transcriptions may be treated as equivalent. Accordingly, 
when prompted, the apparatus may return a match on all targets. This 
embodiment may, for example, be applied to albums such as the "Now That's 
What I Call Music!" series. In this embodiment, the application may handle 
transcriptions such that if the user says "'Play Album' Now That's What I Call 
Music," all matching files found will play, whereas if the user says "'Play 
Album' Now That's What I Call Music Volume Five," only Volume Five will 
play. This functionality may also be applied to 2-Disc albums. For example, 
"Play Album "All Things Must Pass"" may automatically play tracks form both 
Disc 1 and Disc 2 of the two disc album. Alternatively, if the user says "Play 
Album "All Things Must Pass" Disc 2," only tracks from Disc 2 may be played. 

[00159] In an example embodiment, the device may accommodate custom 
variant entries on the user side in order to give meaning to terms like "My 
Favorite Band," "My Favorite Year," or "Mike's Surf-Rock Collection." For 
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example, the apparatus may allow "q)oken editing" (e.g., commanding the 
apparatus to "Call the Foo Fighters "My Favorite Band"). In addition or instead, 
text-based entry may be used to perform this functionality. As phonetic 
metadata 128, 222 may be a component of core metadata, a user may be able to 
edit entries on a computer and then upload them as some kind of tag with the 
file. Thus, in an embodiment, a user may efifectively add user defined 
commands not available with conventional physical touch interfaces. 

[00160] Referring to Figure 15, a method 1 500 for processing a phrase 
received by voice recognition in accordance with an example embodiment is 
illustrated. The method 1500 may be performed at block 1206 (see Figure 12). 

[00161] A phrase may be obtained at block 1 502. For example, the phrase 
may be received by spoken input 1 16 through the automated speech recognition 
engine 1 12 (see Figure 1). The phrase may then be converted to a text string at 
block 1504, such as by use of the automated speech recognition engine 1 12. 

[00162] The converted text string may then be identified with a media 
string at block 1506. An example embodiment of identifying the converted text 
string is described in greater detail below. 

[00163] In an example embodiment, a portion of the converted text string 
may be provided for identification, and the remaining portion may be retained 
and not provided for identification. For example, a first portion provided for 
identification may be a potential name of a media item, and second portion not 
provided for identification may be a command to an appUcation (e.g., "play Billy 
Idol" may have the first portion of "Billy Idol" and the second portion of 
"play"). 

[00164] At decision block 1 508, the method 1 500 may determine whether 
a media string was identified. If the media string was identified, the identified 
text string may be provided for use at block 1510. For example, the phrase may 
be returned to an application for its use, such that the string may be reproduced 
with speech synthesis. 

[00165] If a string was not identified, a non-identification process may be 
performed at block 1512. For example, the non-identification process may be to 
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taKe no action, respond with an error code, and/or make taking an intended 
action with a best guess of the string as the non-identification process. After 
completion of the operations at block 1510 or block 1512, the method 1500 may 
terminate. 

[00166] Figure 16 illustrates a method 1 600 for identifying a converted 
text string in accordance with an example embodiment. In an example 
embodiment, the method 1600 may be performed at block 1506 (see Figure 15). 

[00167] A converted text string may be matched with the display text 704 

of a media item at block 1602. At decision block 1604, the method 1600 may 
determine whether a match was identified. If no match was identified, an 
indication that no match was identified may be returned at block 1606. If a 
string match was identified at decision block 1604, the method 1600 may 
proceed to block 1608. 

[00168] The converted text string may be processed through an alternate 
phrase mapper at block 1 608. For example, the alternate phrase mapper may 
determine whether an alternate phrase exists (e.g., may be identified) for the 
converted text string. 

[00169] In an example embodiment, the alternate phrase mapper may be 
used to facilitate the mapping of alternate phrases to their associated official 
phrase. The alternate phrase mapper may be used within the speech recognition 
and synthesis apparatus 300 (see Figure 3), wherein an uttered alternate phrase 
leads to an official representation of display text 704. For example, if "The 
Stones" is provided as spoken input 1 14; the automated speech recognition 
engine 1 12 may analyze the phonetics of the uttered name and produce the 
defined display text 704 of "The Stones" (see Figures 1 and 7). "The Stones" 
may be submitted to the alternate phrase mapper, which would the return the 
official name "The Rolling Stones". 

[00170] In an example embodiment, the altemate phrase mapper may 
return multiple ofScial phrases in response to a single input altemate phrase 
since there may be more than one official phrase for the same altemate phrase. 

[00171] At decision block 1 61 0, the method 1 600 may determine whether 
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me alternate pnrase nas been identified. If the alternate phrase has not been 
identified, the string for the obtained phonetic transcription may be returned. If 
the alternated phrase has been identified at decision block 1610, a string 
associated with an official transcription may be returned. After completion of 
the operations at block 1612 or block 1614, the method 1600 may terminate. 

[00172] Referring to Figure 17, a method 1700 for providing an output 
string by speech synthesis m accordance with an example embodiment is 
illustrated. Li an example embodiment, the method 1 700 may be performed at 
block 1706 (see Figure 13). 

[001731 A string may be accessed at block 1702. For example, the 
accessed string may be a string for which speech synthesis is desired. A 
phonetic transcription may be accessed for the string at block 1 704. For 
example, a correct phonetic transcription for the spoken language corresponding 
to the string may be accessed. An example embodiment of accessing the 
phonetic transcription for the string is described in greater detail below. 

[00174] In an example, a phonetic transcription for a string may be 
unavailable, such as within the media database 126 and/or the local library 
database 1 18. An example embodiment for creating the phonetic transcription is 
described in greater detail below. 

[00175] The phonetic transcription may be outputted through speech 
synthesis in a language of an application at block 1706. For example, the 
phonetic transcription may be -outputted from the TTS engine 1 10 as the spoken 
output 1 14 (see Figure 1). After completion of the operation at block 1706, the 
method 1700 may terminate. 

[00176] Referring to Figure 18, a method 1 800 for accessing a phonetic 
transcription for a string in accordance with an example embodiment is 
illustrated. In an example embodiment, the method 1 800 may be performed at 
block 1704 (see Figure 18). 

[00177] A written language detection (e.g., detecting a written language) 
of a string and a spoken language detection of a target application (e.g., as may 
be embodied on a target device) may be performed at block 1802. In an example 
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embodiment, the string may be a representation of a media title of the media title 
array 402, a of a primary artist name of the primary artist name array 404, a 
representation of a track title of the track title array 502, a representation of a 
primary artist name of the track primary artist name array 504, a representation 
of a command of the command array 602, and/or a representation of a provider 
of the provider name array 604. La an example embodiment, the target 
application may be the embedded application. 

[00178] At decision block 1 804, the method 1 800 may determine whether 
a regional exception is available for the string. If the regional exception is 
available, a regional phonetic transcription associated with the string may be 
accessed at block 1806. In an example embodiment, the regional phonetic 
transcription may be an alternate phonetic transcription, such as may be due to a 
regional language, local dialect and/or local custom variances. 

[00179] Upon completion of block 1 806, the method 1 800 may proceed to 

decision block 1814. If the regionalized exception is not available for the string 
at decision block 1804, the method 1 800 may proceed to decision block 1 808. 

[00180] The method 1 800 may determine whether a transcription is 

available for the string at decision block 1808. If the transcription is available, 
the transcription associated with the string may be accessed at block 1810. 

[00181] In an example embodiment, the method 1 800 at block 1810 may 
first access a primary transcription that matches the string language when 
available, and when unavailable may access another available transcription (e.g., 
an English transcription). 

[00182] If the transcription is not available for the string at decision block 

1 808, the method 1 800 may programmatically generate a phonetic transcription 
at block 1812. For example, programmatically generating an alternate phonetic 
transcription for a regional mispronunciation in the native language of a speaker 
may use a default G2P already loaded into a device operating the application, 
such that the received text strings upon recognition of content may be run 
througli a default G2P. An example embodiment of programmatically 
generating a phonetic transcription is described in greater detail below. Upon 
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compienon or tne operations at block 1810 and 1812, the method 1800 may 
proceed to decision block 1814. 

[001831 At decision block 1 8 1 4, the method 1 800 may detennine whether 
the written language of the string matches the spoken language of the target 
application. If the written language of the string does not match the spoken 
language of the target application, the obtained phonetic transcription may be 
converted into the spoken language of the target application (e.g., the target 
language) at block 1816. An example embodiment for a method of converting 
the obtained phonetic transcription is described in greater detail below. 

[00184] In an example embodiment, phonetic transcriptions at block 1816 
may be converted from a native spoken language of the string to a target 
language of an application operating on the device using phoneme conversion 
maps. 

[00185] If the written language of the string matches the spoken language 
of the target application at decision block 1814 or after block 1816, the phonetic 
transcription for the string may be provided to the application at block 1818. 
After completion of the operation at block 1818, the method 1 800 may 
terminate. 

[00186] In an example embodiment, the method 1 800 before conducting 
the operation at block 1818 may perform a phonetic alphabet conversion to 
convert the phonetic transcription into a transcription usable by the device. In an 
example embodiment, the phonetic alphabet conversion may be performed after 
the phonetic transcription for the string is provided. 

[00187] Referring to Figure 19, a method 1900 for programmatically 

generatmg the phonetic transcription is illustrated. In an example embodiment, 
the method 1900 may be performed at block 1812 (see Figure 18). 

[00188] At decision block 1902, the method 1900 may determine whether 
a text string includes a written language ID 706 (see Figure 7). If the string 
includes the written language ID 706, the method 1900 may programmatically 
generate a phonetic transcription for a regional mispronunciation in a spoken 
language of an appKcation using G2P at block 1904. 
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[00189] If the text string does not include the written language ID 706 at 
decision block 1902, a phonetic transcription in a written language of the text 
string may be generated at block 1906. For example, a language-specific G2P 
may be used by the speech recognition and synthesis apparatus 300 (see Figure 
3) to generate a phonetic transcription in the written language of the text string. 

[00190] A phoneme conversion map may be used at block 1 908 to convert 
the phonetic transcription in the written language of the text string to one or 
more phonetic transcriptions respectively for one or more target spoken 
languages of an application. 

[00191] In an example embodiment, conversions of the phonetic 
transcriptions may be firom a single phonetic transcription to multiple phonetic 
transcriptions. 

[00192] After completion the operation at block 1904 or block 1910, the 
method 1900 may provide the phonetic transcription to the application. Upon 
completion of the operation at block 1920, the method 1900 may terminate. 

[00193] Referring to Figure 20, a method 2000 for performing phoneme 

conversion is illustrated. In an example embodiment, the method 2000 may be 
performed at block 1816 (see Figure 18). 

[00194] A spoken language ID 804 (see Figure 8) of an application (e.g., 

the embedded application) may be accessed at block 2002. In an example 
embodiment, the spoken language ID 804 of the application may be pre-set. In 
an example embodiment, the spoken language ID 804 of the application may be 
modifiable, such that a language of the embedded application may be selected. 

[00195] A phonetic transcript may be accessed at block 2004, and 
thereafter a written language ID 706 (see Figure 7) for the phonetic transcript 
may be accessed at block 2006. 

[00196] At decision block 2008, the method 2000 may determine whether 

the spoken language ID 804 of the embedded application matches the written 
language ID 706 of the phonetic transcript. If there is not a match, the method 
2000 may convert the phonetic transcript firom the written language to the 
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spoken language at block 2010. If the spoken language ID 804 does not match 
the written language ID 706 at decision block or after block 2010, the method 
2000 may terminate. 

[00197] Referring to Figure 21, a method 2 1 00 for converting a phonetic 
transcription into a target language in accordance with an example embodiment 
is illustrated. In an example embodiment, the method 2100 may be performed at 
block 2010 (see Figure 20). 

[00198] A language of an embedded application (e.g., a target application) 
that will utilize a target phonetic transcription may be determined at block 2102. 
A phonetic language conversion map may be accessed for a source phonetic 
transcription at block 21 04. In an example embodiment, phonetic language 
conversion map may be a phoneme conversion map. 

[00199] The source phonetic transcription may be converted into the 
target phonetic transcription using the phonetic conversion map at block 2106. 
After completion of the operation at block 2106, the method 2100 may 
terminate. 

[00200] In an example embodiment, a character mapping between a 
generic phonetic language and a phonetic language used by the speech 
recognition and synthesis apparatus 300 (see Figure 3) may be created and used 
with the media management system 106. Upon completion of the operation at 
block 2106, the method 2100 may tenninate. 

[00201] Figure 22 shows a diagrammatic representation of machine in the 
exemplary form of a computer system 2200 within which a set of instructions, 
for causing the machine to perform any one or more of the methodologies 
discussed herein, may be executed. In alternative embodiments, the machine 
operates as a standalone device or may be connected (e.g., networked) to other 
machines. In a networked deployment, the machine may operate in the capacity 
of a server or a client machine in server-client network environment, or as a peer 
machine in a peer-to-peer (or distributed) network environment. The machine 
may be a personal computer (PC), a tablet PC, a set-top box (STB), a Personal 
Digital Assistant (PDA), a cellular telephone, a portable music player (e.g., a 
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portable hard dnve audio device such as an MPS player), a car audio device, a 
web appliance, a network router, switch or bridge, or any machine capable of 
executing a set of instructions (sequential or otherwise) that specify actions to be 
taken by that machine. Further, while only a single machine is illustrated, the 
temi "machine" shall also be taken to include any collection of machines that 
individually or jointly execute a set (or multiple sets) of instructions to perform 
any one or more of the methodologies discussed herein. 

[00202] The exemplary computer system 2200 includes a processor 2202 
(e.g., a central processing unit (CPU) a graphics processmg unit (GPU) or both), 
a main memory 2204 and a static memory 2206, which communicate with each 
other via a bus 2208. The computer system 2200 may further include a video 
display unit 2210 (e.g., a Hquid crystal display (LCD) or a cathode ray tube 
(CRT)). The computer system 2200 also includes an alphanumeric input device 
2212 (e.g., a keyboard), a cursor control device 2214 (e.g., a mouse), a disk drive 
unit 2216, a signal generation device 2218 (e.g., a speaker) and a network 
interface device 2230. 

[00203] The disk drive unit 22 1 6 includes a machine-readable medium 
2222 on which is stored one or more sets of instructions (e.g., software 2224) 
embodying any one or more of the methodologies or functions described herein. 
The software 2224 may also reside, completely or at least partially, within the 
main memory 2204 and/or within the processor 2202 during execution thereof 
by the computer system 2200, the main memory 2204 and the processor 2202 
also constituting machine-readable media. 

[00204] The software 2224 may further be transmitted or received over a 
network 2226 via the network interface device 2230. 

[00205] While the machine-readable medium 2222 is shown in an 
exemplary embodiment to be a single medium, the term "machine-readable 
medium" should be taken to include a single medium or multiple media (e.g., a 
centralized or distributed database, and/or associated caches and servers) that 
store the one or more sets of instructions. The term "machine-readable medium" 
shall also be taken to include any medium that is capable of storing, encoding or 
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carrying a set ot instructions for execution by the machine and that cause the 
machine to perform any one or more of the methodologies of the present 
invention. The term "machine-readable medium" shall accordingly be taken to 
include, but not be limited to, soUd-state memories, optical and magnetic media, 
and carrier wave signals. 

[00206] The embodiments described herein may be implemented in an 
operating environment comprising software installed on a computer, in 
hardware, or in a combination of software and hardware. 

[00207] Although the present invention has been described with reference 
to specific example embodiments, it will be evident that various modifications 
and changes may be made to these embodiments without departing fi-om the 
broader spurit and scope of the invention. Accordmgly, the specification and 
drawings are to be regarded in an illusti-ative rather tiian a restiictive sense. 

[00208] The Absti:act of the Disclosure is provided to comply with 37 
C.F.R. § 1.72(b), requiring an absti-act that will allow tiie reader to quickly 
ascertain the nature of the technical disclosure. It is submitted with the 
understanding that it will not be used to interpret or limit the scope or meaning 
of the claims, hi addition, in the foregomg Detailed Description, it can be seen 
that various featijres are grouped together in a single embodiment for the 
purpose of sti-eamlining the disclosure. This metiiod of disclosure is not to be 
interpreted as reflecting an intention tiiat tiie claimed embodknents require more 
features tiian are expressly recited in each claim. Rather, as the following claims 
reflect, inventive subject matter lies in less than all featiires of a single disclosed 
embodiment. Thus tiie following claims are hereby incorporated into tiie 
Detailed Description, witii each claim standing on its own as a separate 
embodiment. 
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CLAIMS 

1 . An apparatus comprising: 

media metadata for a plurality of media items, the media metadata 
comprising a plurality of strings, wherein each string describes an aspect of the 
media items; and 

phonetic metadata associated with the plurality of strings, each portion of 
the phonetic metadata stored in an origin language of the string. 

2. The apparatus of claim 1, wherein media items are selected from at least 
one of compact discs, digital audio tracks, digital versatile discs, movies, or 
photographs. 

3 . The apparatus of claim 1 , wherein the aspect of the media items are 
selected from at least one of a media title, a primary artist name, a track title, a 
command, or a provider. 

4. The apparatus of claim 4, wherein the origin language of the string 
includes a language in which the string would be spoken. 

5. An apparatus with memory to store a data structure comprising: 

a first field comprising a display text, the display text comprising text 
suitable for display; and 

a second field comprising an official phonetic transcription of the display 
text stored in a source language of the display text. 

6. The apparatus of claim 5, wherein the second field fijrther comprises one 
or more alternate phonetic transcriptions of the display text. 
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7. The apparatus of claim 6, wherein the one or more alternate phonetic 
transcriptions of the display text comprises: 

at least one of one or more correct pronunciation phonetic transcriptions 
or one or more incorrect pronunciation phonetic transcriptions. 

8. The apparatus of claim 5 further comprising: 

a written language identification (ID) indicating an origin written 
language of the display text. 

9. The apparatus of claim 5 further comprising: 

an official representation flag to indicate whether the display text is an 
official representation or an alternate representation. 

10. The apparatus of claim 9, wherein the official representation is at least 
one of text that appears on an officially released media or editorially decided, 
and the alternate representation is at least one of a nickname, a short name, or a 
common abbreviation. 



1 1 . The apparatus of claim 9, further comprising an origin language 
transcription flag associated with each phonetic transcription of the second field, 
wherein the origin language transcription flag indicates if the phonetic 
transcription corresponds to the written language ID. 

12. The apparatus of claim 5, further comprising a correct pronunciation flag 
associated with each phonetic transcription of the second field, wherein the 
correct pronunciation flag indicates if the phonetic transcription is a correct 
pronunciation or a mispronunciation of the display text. 
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13. The apparatus of claim 5, wherein the display text is selected from at 
least one of a media title, a primary artist, a track title, a track primary artist 
name, a command array, or a provider, 

14. A method comprising: 

accessing a plurality of strings of media metadata; and 
creating at least one official phonetic transcript for each of the pl\irality 
of strings in an origin language of each string. 

1 5 . The method of claim 1 4, further comprising: 

assigning a spoken language identification (ED) to each of the plurality of 
strings, the spoken language ID indicating an origin language of each of the 
plurality of strings. 

1 6. The method of claim 14, wherein the plurality of strings are each a 
representation of display text, the method fiirther comprising: 

selecting at least one of a media title, a primary artist, a track title, a track 
primary artist name, a command array, or a provider as the display text. 

1 7. The method of claim 15, further comprising: 

creating at least one alternate phonetic transcript for at least a portion of 
the plurality of strings in a non-origin language of each string. 

18. A method comprising: 

recognizing a media item with a digital fingerprint to obtain metadata for 
the media item; and 
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accessing media metadata and associated phonetic metadata for the 
media item, the phonetic metadata comprising at least one phonetic transcription 
in an origin language of the media item. 

1 9. The method of claim 1 8, further comprising: 

configuring the media metadata and the associated phonetic metadata for 
an application. 

20. The method of claim 1 8, further comprising: 

selecting at least one of music metadata, playhsting metadata or 
navigation metadata as the media metadata. 

2 1 . The method of claim 1 8, further comprising: 

providing the associated phonetic metadata to a device during access of 
the media item. 

22. The method of claim 1 8, further comprising: 

reproducing the associated phonetic metadata with speech synthesis 
during access of the media item. 

23. A method comprising: 

matching a converted text string with a media item; 

processing the converted text through an alternate phrase mapper to 
identify a string associated with an official phonetic transcription for the 
converted text string of the media item; and 
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24. The method of claim 23 further comprising: 

providing the string associated with an official phonetic transcription for 
the media item for use by an application. 

25. The method of claim 24 further comprising: 

processing a command using the string associated with an official 
phonetic transcription on a device running the application. 

26. The method of claim 23 further comprising: 
obtaining a phrase; and 

converting the phrase to a converted text string with speech recognition. 

27. A method comprising: 

detecting a spoken language of a string and a target apphcation; 
accessing a phonetic transcription associated with the string; and 
providmg the phonetic transcription associated with the string in the 
spoken language of the target apphcation. 

28. The method of claim 27 further comprising: 

reproducing the phonetic transcription of the string through speech 
synthesis. 

29. The method of claim 27 further comprising: 

accessing a string, wherein the string comprises display text of at least 
one of a media title, a primary artist, a track title, a track primary artist name, a 
command array, or a provider. 
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30. The method of claim 27, wherein accessing a phonetic transcription 
associated with the string comprises: 

accessing a regionalized phonetic transcription associated with the string 
when a regionalized exception is available for the spoken language of the target 
application. 

3 1 . The method of claim 27 further comprising: 

generating a phonetic transcription for the string in the spoken language 
of the target application using G2P. 

32. The method of claim 27 fiirther comprising: 

generating a phonetic transcription for the string in the spoken language 
of the string; and 

converting the phonetic transcription into the spoken language of the 
target application using a phoneme conversion map. 

33. The method of claim 27 fiarther comprising: 

converting the phonetic transcription into the spoken language of the 
target application, 

34. The method of claim 27 further comprising: 

accessing a phonetic language conversion map for the phonetic 
transcription; and 

converting the phonetic transcription into a language of the appHcation 
using the phonetic language conversion map. 
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35. The method of claim 27 further comprising: 

reproducing the phonetic transcription with an embedded application of a 
playback device. 

36. A machine-readable medium comprising instructions, which when 
executed by a machine, cause the machine to: 

access a plurality of strings of media metadata; and 
create at least one official phonetic transcript for each of the plurality of 
strings in an origin language of each string. 

37. The machine-readable medium of claim 36, further comprising 
instructions, which when executed by a machine, cause the machine to: 

create at least one alternate phonetic transcript for at least a portion of the 
plurality of strings in a non-origin language of each string. 

38. A machine-readable medium comprising instructions, which when 
executed by a machine, cause the machine to: 

match a converted text string with a media item; 

process the converted text through an alternate phrase mapper to identify 
a string associated with an official phonetic transcription for the converted text 
string of the media item; and 

process the string associated with then official phonetic transcription 
with speech synthesis. 

39. A machine-readable medium comprising instructions, which when 
executed by a machine, cause the machine to: 

perform a spoken language detection of a string and a target application; 
access a phonetic transcription associated with the string; and 
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reproduce the phonetic transcription associated with the string in the 
spoken language of the target application through speech synthesis. 

40. The apparatus comprising; 

means for accessing a plurality of strings of media metadata; and 
means for creating at least one official phonetic transcript for each of the 
plurality of strings in an origin language of each string, 

41 . The apparatus of claim 40 further comprising: 

means for creating at least one alternate phonetic transcript for at least a 
portion of the pluraUty of strings in a non-origin language of each string. 
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